Just-in-time language modelling
نویسندگان
چکیده
Traditional approaches to language modelling have relied on a fixed corpus of text to inform the parameters of a probability distribution over word sequences. Increasing the corpus size often leads to better-performing language models, but no matter how large, the corpus is a static entity, unable to reflect information about events which postdate it. In these pages we introduce an online paradigm which interleaves the estimation and application of a language model. We present a Bayesian approach to online language modelling, in which the marginal probabilities of a static trigram model are dynamically updated to match the topic being dictated to the system. We also describe the architecture of a prototype we have implemented which uses the World Wide Web (WWW) as a source of information, and provide results from some initial proof of concept experiments. 1. BACKGROUND Of pressing concern to language modelling researchers is how to detect and account for a “non-stationary” source; that is, a source of words whose distribution changes over time. To take a concrete example, suppose we put an ASR system to the task of transcribing the evening news (to automatically generate a transcript for the hearing-impaired, say). The anchor may begin with a segment on Bosnia, in which words such as strife, famine, Serbs, Albright and U.N. are more probable than in general. Then the anchor moves to a story on the Iditarod dog sled race in Alaska, during which time the words snow, mush, cold and canine are more likely than in general. Adaptive language modelling addresses the task of ensuring that a model keeps up with a changing source distribution. In short: as the topic changes, so should the model. One approach has been to partition the training corpus into a number of topics—a coarse division might be sports, politics and useless banter—and train individual models on each topic. When applying this composite model, one needs somehow to detect the topic at hand and select the model appropriate to the topic. A somewhat more refined approach is to allow a few topics to be active at once, and apply a weighted average of the individual topic models [1]. These approaches rely on a training corpus fixed “offline,” prior to applying the model. Such an approach works well when the topic at hand is to be found in the training corpus, but not when Adam Berger is partially supported by an IBM Cooperative Fellowship. Robert Miller is partially supported by an NDSEG Fellowship. ASR system Hypothesized utterance Update corpus CU Spoken input Query Text Dynamic language model λ, S Update algorithm Text source (WWW) Query engine Updated λ Text output Figure 1: ASR system using just-in-time language modelling. the topic is absent from the corpus. This is not a problem that can be fixed with Moore’s law and patience: nearly any source distribution which an ASR system is going to encounter will change as events occur in the world. No speech recognition system trained on data prior to 1993, for example, could possibly recognize that Marlins and baseball have a strong lexical correlation. This document describes a language modelling system in which the estimation and application of the model are coupled; that is, a model which learns as it works. We envision the behavior of an ASR system incorporating this language model as follows (Figure 1). In processing a single utterance, the system uses the current language model to generate a hypothesis for the utterance. The hypothesis is then (while the system awaits the next utterance, say) sent to a query engine, which generates an update corpus based on using the hypothesis as a set of keywords in a search. The language model is reestimated to take into account the update corpus, and can be applied either to rescore the current utterance, or just to process the next utterance. 2. AN ONLINE MODEL OF LANGUAGE Given is some stationary distribution (our default model) . We imagine to be a trigram or similar conventional language model, whose parameters have been estimated offline on a large corpus of text . The system is occasionally provided with an “update corpus” of text which (one hopes) has a high semantic correlation with the current topic of dictation, but which is likely to be much smaller—by several orders of magnitude—than . How one generates such a corpus is taken up in the following section. Our concern here is to construct a language model which incorporates knowledge gleaned from into . Some desired properties of : 1We make no explicit assumptions about the form of in what follows, though the prototype described later uses the trigram model of [2] (1) The influence which the update corpus plays should increase with its size (number of words) . That is, as its size increases, we should consider a more reliable source of information, and pay it greater heed. (2) But how much greater heed? There should be a “knob” to adjust how much we adjust as a function of the size of the update corpus. We’ll consider language models in the exponential family
منابع مشابه
Analysis of interactions among the barriers to JIT production: interpretive structural modelling approach
‘Survival of the fittest’ is the reality in modern global competition. Organizations around the globe are adopting or willing to embrace just-in-time (JIT) production to reinforce the competitiveness. Even though JIT is the most powerful inventory management methodologies it is not free from barriers. Barriers derail the implementation of JIT production system. One of the most significant tasks...
متن کاملCollocational Processing in Two Languages: A psycholinguistic comparison of monolinguals and bilinguals
With the renewed interest in the field of second language learning for the knowledge of collocating words, research findings in favour of holistic processing of formulaic language could support the idea that these language units facilitate efficient language processing. This study investigated the difference between processing of a first language (L1) and a second language (L2) of congruent col...
متن کاملDevelopment of framework for sustainable Lean implementation: an ISM approach
The survival of any organization depends upon its competitive edge. Even though Lean is one of the most powerful quality improvement methodologies, nearly two-thirds of the Lean implementations results in failures and less than one-fifth of those implemented have sustained results. One of the most significant tasks of top management is to identify, understand and deploy the significant Lean pra...
متن کاملFirst-class models : on a noncausal language for higher-order and structurally dynamic modelling and simulation
The field of physical modelling and simulation plays a vital role in advancing numerous scientific and engineering disciplines. To cope with the increasing size and complexity of physical models, a number of modelling and simulation languages have been developed. These languages can be divided into two broad categories: causal and noncausal. Causal languages express a system model in terms of d...
متن کاملPersonalisation for a Just-In-Time Organisational Training System
A major problem facing newcomers to an organisation is the overload of information they need to assimilate. Recent work in Knowledge Management offers promising approaches to formalising institutional knowledge so that it can be captured, shared and reused. Documents with this knowledge can promote learning best if they are presented at the right time and in the right format. This paper describ...
متن کاملJavaFrame: Framework for Java Enabled Modelling
Support for the modelling of large, complex (and especially real-time) systems has recently attracted much attention. It is generally agreed that UML does not have adequate mechanisms. Projects using UML [7] therefore do not rely on UML for modelling the overall system structure. They use UML for object modelling (with classes and associations) and for use case/interaction modelling, while syst...
متن کامل